Address Normalization


In [22]:
from nltk.corpus import stopwords
import string
from transform.normalizer import *
from transform.parser import *
from match.match import *
import inspect
import jellyfish
from retrieve.search import *

First, let's read the data that we're going to use to normalize and parse the addresses:


In [7]:
punctuation = set(string.punctuation)
language = 'portuguese'

prefix_file = '../data/prefixes.csv'

with open(prefix_file, 'r') as g:
    prefixes = g.read().splitlines()
address_prefixes = prefixes

stopw = stopwords.words(language)
  • punctuation is the file with the punctuation characters that we want to remove.
  • prefixes are the the common address prefixes that we want to remove.
  • stopw are the common Portuguese stopwords that we also want to remove.

In [28]:
address = "Rua XV de Novembro, 123 bloco 23 A"

normalized_address = normalize_address(
    address, punctuation, stopw , address_prefixes)
print("Normalized address: ", normalized_address)


Normalized address:  xv novembro 123 bloco 23

So what are we doing here? Let's see what normalize_address is doing:


In [15]:
inspect.getsourcelines(normalize_address)


Out[15]:
(['def normalize_address(input_string, punctuation, stopwords, prefixes):\n',
  '    pipeline = [\n',
  '        transform_encoding,\n',
  '        transform_case,\n',
  '        partial(remove_punctuation, punctuation=punctuation),\n',
  '        partial(remove_stopwords, stopwords=stopwords),\n',
  '        partial(remove_address_prefixes, address_prefixes=prefixes)\n',
  '    ]\n',
  '    return reduce((lambda value, func: func(value)), pipeline, input_string)\n'],
 41)

So we are doing several operations in sequence:

  • transform_encoding
  • transform_case
  • remove_punctuation
  • remove_stopwords
  • remove_address_prefixes

Applying the next function to the results of the previous one. So what's next?

Parsing the address

After we normalized the address we want to parse it, selecting the relevant parts. We can do that with Regex or Named Entity Recognition. First, let's try to use regular expressions:


In [24]:
parsed_address = parse_address(normalized_address)
print(parsed_address)


{'street': 'xv novembro', 'complement': 'bloco 23', 'number': '23124'}

So how are we doing that?


In [17]:
inspect.getsourcelines(parse_address)


Out[17]:
(['def parse_address(input_string):\n',
  "    matched = re.findall(r'^(\\S+\\D*?)\\s*(\\d+)|(\\S.*)', input_string)\n",
  '    clean = list(filter(None, [e for l in matched for e in l]))\n',
  '    return {\n',
  "        'street': clean[0],\n",
  "        'number':clean[1],\n",
  "        'complement':clean[2]\n",
  '    }\n'],
 4)

That's the regular expression: ^(\\S+\\D*?)\\s*(\\d+)|(\\S.*).

It means:

1st Alternative: ^(\S+\D*?)\s*(\d+)
^ assert position at start of the string

1st Capturing group (\S+\D*?)
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\D*? match any character that's not a digit [^0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]

2nd Capturing group (\d+)
\d+ match a digit [0-9]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (\S.*)

3rd Capturing group (\S.*)
\S match any non-white space character [^\r\n\t\f ]
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]

Wow, that's very hard to understand.

But now we have our address normalized and separated on its components. We can now try to match it with the canonical source.

Match

The previous steps did not correct for misspellings or other errors. If we have a canonical database, we can try to reduce those errors and transform our address in our database to a canonical form. For that, we have to match the normalized address with our reference database. First, how we calculate that two addresses are similar?

Similarity

We can compute a similarity between two strings. There are several algorithms to do that. We will use the Jaro-Winkler distance for that. There are several others, like

Candidates for match

How can we retrieve candidates to match from our canonical database?

There are a few approaches:

  • Brute force (all against all)
  • Search by field

We will try to search for candidates and do a match with them.


In [30]:
schema = create_schema()
idx = create_index(schema, 'indexdir')
results = search(parsed_address['street'], 'street', idx)
print(results)


[9, 12]
[{'street': 'XV de novembro', 'complement': 'bloco 22', 'city': 'São Paulo', 'number': 123, 'cep': '02837-223'}, {'street': 'XV de novembro', 'complement': 'bloco 23 A', 'city': 'São Paulo', 'number': 123, 'cep': '02837-223'}]

So now we have our candidates! But the other informations are different from what whe have in our address. Is this a match?

Match

Let's devise a way to match these two addresses:


In [29]:
print(address)


Rua XV de Novembro, 123 bloco 23 A

We have some prior information about how addresses are and which parts are more important than others. We can devise a matching algorithm with a linear regression, for example, using the knowledge that street names are more important than complements:


In [33]:
similarity(parsed_address['street'],results[0]['street'] )


Out[33]:
0.7462722462722463

In [34]:
similarity(parsed_address['street'],results[1]['street'] )


Out[34]:
0.7462722462722463

So the similarity of the street name is exactly the same. Let's compare the numbers:


In [36]:
similarity(str(parsed_address['number']),str(results[0]['number'] ))


Out[36]:
0.6888888888888888

In [37]:
similarity(str(parsed_address['number']),str(results[1]['number'] ))


Out[37]:
0.6888888888888888

Ops, still the same. Let's go to complements:


In [38]:
similarity(parsed_address['complement'],results[0]['complement'] )


Out[38]:
0.95

In [39]:
similarity(parsed_address['complement'],results[1]['complement'] )


Out[39]:
0.96

Ok, there's a small difference, but we will assume that we can work with that! The second canonical address is a better match than the first one:


In [59]:
print("Original Address:", address)
print("Canonical Address:", str(results[1]['street']) + ', ' + str(results[1]['number']) + ' ' + str(results[1]['complement']))


Original Address: Rua XV de Novembro, 123 bloco 23 A
Canonical Address: XV de novembro, 123 bloco 23 A